Search CORE

86 research outputs found

Generating Literal and Implied Subquestions to Fact-check Complex Claims

Author: Chen Jifan
Choi Eunsol
Durrett Greg
Sriram Aniruddh
Publication venue
Publication date: 13/05/2022
Field of study

Verifying complex political claims is a challenging task, especially when politicians use various tactics to subtly misrepresent the facts. Automatic fact-checking systems fall short here, and their predictions like "half-true" are not very useful in isolation, since we have no idea which parts of the claim are true and which are not. In this work, we focus on decomposing a complex claim into a comprehensive set of yes-no subquestions whose answers influence the veracity of the claim. We present ClaimDecomp, a dataset of decompositions for over 1000 claims. Given a claim and its verification paragraph written by fact-checkers, our trained annotators write subquestions covering both explicit propositions of the original claim and its implicit facets, such as asking about additional political context that changes our view of the claim's veracity. We study whether state-of-the-art models can generate such subquestions, showing that these models generate reasonable questions to ask, but predicting the comprehensive set of subquestions from the original claim without evidence remains challenging. We further show that these subquestions can help identify relevant evidence to fact-check the full claim and derive the veracity through their answers, suggesting that they can be useful pieces of a fact-checking pipeline

arXiv.org e-Print Archive

Using Natural Language Explanations to Rescale Human Judgments

Author: Chen Jifan
Durrett Greg
Li Junyi Jessy
Wadhwa Manya
Publication venue
Publication date: 14/11/2023
Field of study

The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over crowdworker judgments. However, annotators' judgments for subjective tasks can differ in many ways: they may have different qualitative judgments about an example, and they may map those to a labeling scheme in different ways. We show that these nuances can be captured by natural language explanations, and propose a method to rescale ordinal annotations and explanations using LLMs. Specifically, we feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. These scores should reflect the annotators' underlying assessments of the example. The rubric can be designed or modified after annotation, and include distinctions that may not have been known when the original error taxonomy was devised. We explore our technique in the context of rating system outputs for a document-grounded question answering task, where LLMs achieve near-human performance. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.Comment: Data available at https://github.com/ManyaWadhwa/explanation_based_rescalin

arXiv.org e-Print Archive

Recommended from our members

Building robust and modular question answering systems

Author: Chen Jifan (Ph. D. in Computer Science)
Publication venue
Publication date: 29/07/2023
Field of study

Over the past few years, significant progress has been made in QA systems due to the availability of annotated datasets on a large scale and the impressive advancements in large-scale pre-trained language models. Despite these successes, the black-box nature of end-to-end trained QA systems makes them hard to interpret and control. When these systems encounter inputs that deviate from their training data distribution or are subjected to adversarial perturbations, their performance tends to deteriorate by a large margin. Furthermore, they may occasionally produce unanticipated results, potentially leading to confusion among users. Additionally, this deficiency in robustness and interpretability poses challenges when deploying such models in real-world scenarios. In this dissertation, we aim to build robust QA systems by explicitly decomposing various QA tasks into distinct sub-modules, each responsible for a particular aspect of the overall QA process. Through this decomposition, we seek to achieve improved performance in terms of both the system's ability to handle diverse and challenging inputs (robustness) and its capacity to provide transparent and explainable reasoning (interpretability). To address the aforementioned limitations, in this dissertation, we aim to build robust QA models by explicitly decomposing different QA tasks into different sub-modules. We argue that utilizing these sub-modules can substantially improve the robustness and interpretability of different QA systems. In the first half of this dissertation, we introduce three sub-modules to mitigate the dataset artifacts that models learn from datasets. These sub-modules also enable us to examine and exert explicit control over the intermediate outputs. In the first work, to address question answering that requires multi-hop reasoning, we propose a chain extractor, which extracts the reasoning chains necessary for models to derive the final answer. The reasoning chains not only prevent the model from exploiting reasoning shortcuts but also provide an explanation of how the answer is derived. In the second work, we incorporate an alignment layer between the question and the context before generating the answer. This alignment layer can help us interpret the models' behavior and improve the robustness of adversarial settings. In the third work, we add an answer verifier after QA models generate the answer. This verifier can boost QA models' prediction confidence across several different domains and help us spot cases where QA models predict the right answer for the wrong reason by utilizing the external NLI datasets and models. In the second half of this dissertation, we tackle the problem of complex fact-checking in the real world by treating it as a modularized QA task. We first decompose a complex claim into several yes-no subquestions whose answer directly contributes to the veracity of the claim. Then, each sub-question is fed into a commercial search engine to retrieve relevant documents. Additionally, we extract the relevant snippets in the retrieved documents and use a GPT3-based summarizer to generate the core evidence for checking the claim. We show that the decompositions can play an important role in both evidence retrieval and veracity composition of an explainable fact-checking system. Also, we show the GPT3-based evidence summarizer generates faithful summaries of documents most of the time indicating it can be used as an effective part of the pipeline. Moreover, we annotate a dataset -- ClaimDecomp, containing 1,200 complex claims and the decompositions. We believe that this dataset can further promote building explainable fact-checking systems and analyzing complex claims in the real world.Computer Science

Texas ScholarWorks

How to Evaluate Semantic Communications for Images with ViTScore Metric?

Author: Chen Junjie
Fu Jingqiao
Han Tingchen
Liang Jifan
Peng Bo
Wan Hai
Zhu Tingting
Publication venue
Publication date: 09/09/2023
Field of study

Semantic communications (SC) have been expected to be a new paradigm shifting to catalyze the next generation communication, whose main concerns shift from accurate bit transmission to effective semantic information exchange in communications. However, the previous and widely-used metrics for images are not applicable to evaluate the image semantic similarity in SC. Classical metrics to measure the similarity between two images usually rely on the pixel level or the structural level, such as the PSNR and the MS-SSIM. Straightforwardly using some tailored metrics based on deep-learning methods in CV community, such as the LPIPS, is infeasible for SC. To tackle this, inspired by BERTScore in NLP community, we propose a novel metric for evaluating image semantic similarity, named Vision Transformer Score (ViTScore). We prove theoretically that ViTScore has 3 important properties, including symmetry, boundedness, and normalization, which make ViTScore convenient and intuitive for image measurement. To evaluate the performance of ViTScore, we compare ViTScore with 3 typical metrics (PSNR, MS-SSIM, and LPIPS) through 5 classes of experiments. Experimental results demonstrate that ViTScore can better evaluate the image semantic similarity than the other 3 typical metrics, which indicates that ViTScore is an effective performance metric when deployed in SC scenarios

arXiv.org e-Print Archive

Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

Author: Chen Yuxiang
Hou Lei
Li Juanzi
Liu Jinxin
Qi Ji
Sun Jiuding
Wang Xiaozhi
Xu Bin
Yu Jifan
Zeng Kaisheng
Zhang Chuchun
Publication venue
Publication date: 24/10/2023
Field of study

The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code are available at https://github.com/qijimrc/ROBUST.Comment: Accepted by EMNLP 2023 Main Conferenc

arXiv.org e-Print Archive

VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering

Author: Cao Shulin
Chen Yuanyong
Hou Lei
Jin Hailong
Li Juanzi
Lv Xin
Xin Amy
Xu Jianjun
Yao Zijun
Yu Jifan
Zhang Peng
Publication venue
Publication date: 06/07/2023
Field of study

We present Visual Knowledge oriented Programming platform (VisKoP), a knowledge base question answering (KBQA) system that integrates human into the loop to edit and debug the knowledge base (KB) queries. VisKoP not only provides a neural program induction module, which converts natural language questions into knowledge oriented program language (KoPL), but also maps KoPL programs into graphical elements. KoPL programs can be edited with simple graphical operators, such as dragging to add knowledge operators and slot filling to designate operator arguments. Moreover, VisKoP provides auto-completion for its knowledge base schema and users can easily debug the KoPL program by checking its intermediate results. To facilitate the practical KBQA on a million-entity-level KB, we design a highly efficient KoPL execution engine for the back-end. Experiment results show that VisKoP is highly efficient and user interaction can fix a large portion of wrong KoPL programs to acquire the correct answer. The VisKoP online demo https://demoviskop.xlore.cn (Stable release of this paper) and https://viskop.xlore.cn (Beta release with new features), highly efficient KoPL engine https://pypi.org/project/kopl-engine, and screencast video https://youtu.be/zAbJtxFPTXo are now publicly available

arXiv.org e-Print Archive

Annual report 1984-1985

Author: Andrew Hoffman
Bihui Zhong
Fang Gu
Guang Yang
Jifan Hu
Minhu Chen
Minrui Li
Shenghong Zhang
Shuling Chen
Publication venue: Justice Institute of British Columbia
Publication date: 01/01/1985
Field of study

BACKGROUND: HOTAIR, a newly discovered long intergenic noncoding RNA (lincRNA), has been reported to be aberrantly expressed in many types of cancers. This meta-analysis summarizes its potential role as a biomarker in malignancy. METHODS: A quantitative meta-analysis was performed through a systematic search in Pubmed, Medline and Web of Science for eligible papers on the prognostic impact of HOTAIR in cancer from inception to Feb. 28, 2014. Pooled hazard ratios (HRs) with 95% confidence interval (95% CI) were calculated to summarize the effect. RESULTS: Nineteen studies were included in the study, with a total of 2033 patients. A significant association was observed between high HOTAIR expression and poor overall survival (OS) in patients with cancer (pooled HR 2.22, 95% CI: 1.68-2.93). Place of residence (Asian or Western countries), type of cancer (digestive or non-digestive disease), sample size (more or less than 100), and paper quality (score more or less than 85%) did not alter the significant predictive value of HOTAIR in OS from various kinds of cancer but preoperative status did. By combining HRs from Cox multivariate analyses, we found that HOTAIR expression was an independent prognostic factor for cancer patients (pooled HR 2.26, 95% CI: 1.62-3.15). Subgroup analysis showed that HOTAIR abundance was an independent prognostic factor for cancer metastasis (HR 3.90, 95% CI: 2.25-6.74). For esophageal carcinoma, high HOTAIR expression was significantly associated with TNM stage (III/IV vs. I/II: OR 6.90, 95% CI: 2.81-16.9) without heterogeneity. In gastric cancer, HOTAIR expression was found to be significantly associated with lymph node metastases (present vs. absent: OR 4.47, 95% CI: 1.88-10.63) and vessel invasion (positive vs. negative: OR 2.88, 95% CI: 1.38-6.04) without obvious heterogeneity. CONCLUSIONS: HOTAIR abundance may serve as a novel predictive factor for poor prognosis in different types of cancers in both Asian and Western countries

Directory of Open Access Journals

Arca British Columbia's network of post-secondary digital repositories